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Recent advances in high-throughout sequencing technologies have made it possible to 
accurately assign copy number (CN) at CN variable loci. However, current analytic methods 
often perform poorly in regions in which complex CN variation is observed. Here we 
report the development of a read depth-based approach, CNVrd2, for investigation of 
CN variation using high-throughput sequencing data. This methodology was developed 
using data from the 1000 Genomes Project from the CCL3L1 locus, and tested using 
data from the DEFB103A locus. In both cases, samples were selected for which paralog 
ratio test data were also available for comparison. The CNVrd2 method first uses 
observed read-count ratios to refine segmentation results in one population. Then a 
linear regression model is applied to adjust the results across multiple populations, in 
combination with a Bayesian normal mixture model to cluster segmentation scores into 
groups for individual CN counts. The performance of CNVrd2 was compared to that of two 
other read depth-based methods (CNVnator, cn.mops) at the CCL3L1 and DEFBW3A loci. 
The highest concordance with the paralog ratio test method was observed for CNVrd2 
(77.8/90.4% for CNVrd2, 36.7/4.8% for cn.mops and 72/1% for CNVnator at CCL3L1 
and DEF103A). CNVrd2 is available as an R package as part of the Bioconductor project: 
http://www.bioconductor.org/packages/release/bioc/html/CNVrd2.html. 
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INTRODUCTION 

Copy number variation (CNV) encompassing genes is a com- 
mon phenomenon in the human genome, and has been shown to 
be associated with variation in phenotype (Gonzalez et al., 2005; 
Freeman et al, 2006; McKinney et al, 2008, 2010, 2012; Bentley 
et al., 2010). Accurate CN assignment is required for studies asso- 
ciating CNV loci with phenotype. However, accurately measuring 
gene CN by direct molecular methods over large numbers of sam- 
ples is challenging, and is often complicated by the existence of 
paralogous gene pairs. As a consequence, phenotypic association 
data, largely obtained by quantitative polymerase chain reaction 
(Q-PCR) based methods, should be regarded with caution (He 
et al, 2009; Marques et al, 2010; Carpenter et al, 2011; McKinney 
and Merriman, 2012). In response to the increasing utility of 
high-throughput sequencing (HTS) technologies, we previously 
developed the read depth based method CNVrd (Nguyen et al., 
2013). We demonstrated its ability to accurately assign CN using 
genome-wide HTS data duplications and deletions at the FCGR 
locus, where CN ranges from zero to four. To date, however, the 
utility of this approach for genotyping loci at which a greater 
range of CN exists is untested. 

One complex CN variable locus is CCL3L1. Copy number 
variation at CCL3L1 has been associated with susceptibility to 
HIV infection (Liu et al., 2010), autoimmune disease (Burns 
et al, 2005; Mamtani et al., 2008; McKinney et al, 2008) and 



asthma (Lee et al, 2011). The median CN of CCL3L1 is 2 in 
European populations and >2 in other populations (Gonzalez 
et al., 2005). Evaluation of the role of CCL3L1 CNV in common 
disease has been hampered by robustness of methodology, par- 
ticularly that based on Q-PCR (He et al., 2009; Carpenter et al., 
2011). Similarly, CNV within the beta-defensin locus on chro- 
mosome 8, which includes the DEFB4 and DEFB103A genes that 
vary in CN en bloc (a range of two to nine copies) (Groth et al., 
2008), has been associated with various infectious and inflam- 
matory phenotypes (Bentley et al., 2010; Mehlotra et al, 2012; 
Zhou et al., 2012). As is the case for CCL3L1, however, inter- 
pretation of these data, and further progress in studying the role 
of beta-defensin CN in medical phenotypes, is hampered by the 
difficulty in accurately measuring CN in large numbers of sam- 
ples (Aldhous et al., 2010). Paralog ratio test (PRT) copy number 
data are available at both CCL3L1 and the beta-defensin locus 
(Armour et al., 2007; Walker et al., 2009), and thus provide an 
important resource for developing and validating new HTS read 
depth-based copy number assignment approaches. The PRT uses 
multiple probes to compare CN to an invariant paralog and is cur- 
rently regarded as the gold-standard method for measuring CN at 
complex loci (McKinney and Merriman, 2012). 

Here we describe the development of CNVrd2 as an extension 
of the CNVrd methodology, and demonstrate the application of 
this new approach by assigning copy number at the development 
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locus CCL3L1, and test locus DEFB103A, both of which have 
previously had CN measured using the PRT. 

SUBJECTS AND METHODS 
HTS AND MICROARRAY DATA USED 

The genomic coordinates (hgl9) used in this research 
were derived from the NCBI (http://www.ncbi.nlm.nih. 
gov/) and were CCL3 (chrl7:34415602-34417506), CCL3L1 
(chrl7:34623842-34625730), CCL3L3 (chrl7:34623842-3462 
5730), CCL4 (chrl7:34431220-34433014), CCL4L1 (chrl7:345 
38468-34540275), DEFB103A (chr8:7738726-7740105) and 
DEFB103B (chr8:7286410-7287682). Data were downloaded 
from the European Bioinformatics Institute (EBI) FTP server 
(ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/data/ and ftp://ftp.1000 
genomes.ebi.ac.uk/voll/ftp/phasel/data/). For CCL3L1, 2535 
samples (Table 1) downloaded from 26 populations on February 
10, 2014 were used. These had an average coverage of 2.8-40.6 x 
over a 2Mb region around the CCL3L1 locus. For DEFB103A, 
2535 samples (Table 2) downloaded on August 13, 2013 were 
used. These had an average coverage of 2.8-30.9 x (median 6.9 x ) 
over the 2 Mb region around the DEFB103A locus. Only samples 
sequenced on lUumina platforms were used, so as to avoid cross 
platform variability. BWA-aligned data were available in BAM 
format (Li et al, 2009). 

THE WORKFLOW OF THE PIPELINE 

The entire pipeline is represented in Figure 1. The pipeline is used 
to measure CN for a specific gene or locus. Four main steps are 
described below. 

PIPELINE WORKFLOW: IDENTIFICATION OF APPROXIMATE 
BOUNDARIES OF MEASURED REGIONS 

Samples were segmented as previously described for the CNVrd 
procedure (Nguyen et al, 2013). The results of the segmentation 
process were used to obtain approximate boundaries of copy- 
number variable regions encompassing the gene being measured. 
CNVrd2 used different percentiles and standard deviations of 
segmentation results to visually obtain information of CNVs at 
loci. Then, standard deviations were used to obtain approximate 
boundaries as follows (Step C, Figure 1). 

To identify approximate boundaries of putative polymorphic 
regions (Step C, Figure 1), the coordinates of regions having sim- 
ilar segmentation results for each sample were obtained from the 
segmentation process applied to the large 1-2 Mb regions across 
all samples. Next, all these coordinates were combined to generate 
sub-regions, and segmentation results generated by the original 
segmentation analysis for each sample were obtained. Finally, 
the standard deviations and various percentiles of segmentation 
results of each sub-region across the 2535 samples were calculated 
and plotted to visually confirm CNV. Windows exhibiting large 
standard deviations were used to identify approximate positions 
of putative regions being assigned CN. 

PIPELINE WORKFLOW: OBTAINING SEGMENTATION SCORES FOR 
INDIVIDUAL SAMPLES AT THE MEASURED LOCUS 

This step was similar to our previous pipeline (CNVrd) but with 
some modifications (Step D, Figure 1) added to obtain more 



reliable segmentation scores. If the entirety of the gene of inter- 
est was placed in a single segment then the segmentation score for 
the gene was the segmentation score for that region. If the gene 
was split into multiple segments, each having the same sign then 
the segmentation score for the gene was calculated as the length- 
weighted average across the segments (NA18612 for example in 
Step D, Figure 1). Otherwise, if a segment at the boundary of 
the gene had a different sign from other sub-regions of the gene 
(HG00254 for example) and the boundary segment's length was 
less than a specific threshold (here the threshold was set to half 
the window size) then the z-score for the boundary segment was 
calculated, where the z-score is defined as the value of the stan- 
dardized observed read-count ratio (Nguyen et al, 2013). If the 
z-score had the same sign as the segmentation scores from other 
sub-regions of the gene, then the segmentation score for the gene 
was again calculated as the length-weighted average across the 
other segments. For all other situations the gene of interest was 
assigned a segmentation score of zero, reflecting the population 
average at that locus. 

PIPELINE WORKFLOW: ADJUSTMENT OF THE SEGMENTATION 
PROCESS FOR MULTIPLE POPULATIONS 

The pipeline was applied to each of the five major populations 
(European, East Asian, West African, South Asian, and Americas) 
to obtain segmentation scores (SSP: the segmentation scores for 
a single major population). This identified the scores in each of 
the large populations, but did not allow comparisons between 
the populations. Therefore, we ran the segmentation process for 
all samples and obtained segmentation scores for samples in all 
populations [SSP{p): the segmentation scores for pooled popu- 
lations]. When we pooled samples and ran the pipeline, if the 
gene being analyzed was not segmented into a single region for a 
particular sample, then that sample was assigned a segmentation 
score of zero. For example, the median CCL3L1 CN for European- 
ancestry sample sets was less than other populations, therefore the 
majority of segmentation results and z-scores (where needed — 
see above) were less than zero. If the CCL3L1 gene region of a 
European-ancestry sample was segmented into two sub-regions, 
both having negative signs in the all-population analysis, then 
this could either be recapitulated in the single-population analy- 
sis (implying they were within the same CN region) or they could 
be sub-regions that could span the population average. Thus, 
they could have one negative and one positive sign (implying 
they were within two distinct CN regions). In this case, however, 
the pipeline only recognized that sub-regions shared the same 
sign, and the two regions would therefore be considered as one 
region and would be merged. To improve this, for each of the five 
major populations, we fitted a linear regression line with SSP and 
SSP(p) as independent and dependent variables respectively (data 
for CCL3L1 are shown in Figure 2). This step aimed to reduce 
errors when performing segmentation within single major popu- 
lations or pooled populations. The fitted mean values of the five 
single linear regression models were used as the final per sam- 
ple segmentation scores for these five populations. Segmentation 
scores were transformed and standardized, and a normal mix- 
ture model was used to cluster the scores into CN genotype 
groups. 
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1) Count reads m windows, then correct 
GC-content bias. 

2) Transform RCs to the same coverage. 

3) Standardize RCs across samples. 
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Segment into different regions. 
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1) Visually confirm the gene/region being 
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2 ) Obtain approximate boundaries of 
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1 ) Merge segmentation results inside regions 
being measured. 

2) Adjust segmentation results at boundariei 
being measured. 




Merge these 
segments based 
on a threshold 




Based on a 
threshold, test 
this boundary 
region whether 
it IS in a 
different region 



Use a simple linear regression to adjust 
segmentation results across different 
populations. 



Use: 

1) A simple normal mixture model 
Or 

2) A Bayesian normal mixture model 
for highly copy-number variable loci. 




Prior information obtained 
from results of small 
populations (red) 



S^^gmeutation score 



Obtain copy-number count 



FIGURE 1 I A schematic diagram of the pipeline. CNVrd2 is a 

modified version of CNVrd. CNVrd2 is identical to CNVrd at the 
counting, transforming, standardizing and segmenting steps (A,B,G: 
black text). However, CNVrd2 fias additional steps: identification of 
polymorphic regions (C), merging sub-regions inside genes/regions being 



measured and testing boundary regions (D), using a simple linear 
regression model to adjust segmentation scores between populations 
(E) and a Bayesian normal mixture to cluster segmentation scores of 
highly CN variable regions into different groups (F). These new steps 
are in blue text. 



PIPELINE WORKFLOW: ASSIGNING COPY-NUMBER COUNTS 

A normal mixture model was used to cluster final segmentation 
scores into different groups. If clear clusters were observed then 
the Expectation Maximization (EM) algorithm (Dempster et al., 
1977) was used in the clustering process. If data were compli- 
cated (e.g., loci with high copy-number variants such as CCL3L1 
and DEFB103A) then a Bayesian clustering approach was used 



(Step F, Figure 1; details of the analysis process for CCL3L1 and 
DEFB103A are described below). 

IDENTIFICATION OF WINDOW SIZE PARAMETERS FOR CNVrd2 USING 
TEST LOCI 

One of the new innovations in CNVrd2 was the merging of seg- 
mentation results inside regions being measured if these regions 
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Americas East Asian ancestry 
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FIGURE 2 I Linear relationsliip between thie segmentation scores called for single populations and for all populations at CCL3L1. SSR the 

segmentation scores of a single large population; SSP{p), the segmentation score of pooled populations. 



were divided into different parts (Step D, Figure 1). We merged 
these parts if their signal values were above (below) a thresh- 
old (namely "testThreshold2Merge" in the CNVrd2 package) for 
duplicated events (deleted events). Another parameter was the 
length of a boundary region, the default for which was set to 
"half window size." Real data sets were used for investigation of 
the parameters of CNVrd2. In order to make reliable data sets we 
intersected the CNV results of two sets of data (Conrad et al., 
2010; Campbell et al., 2011). From these intersecting data we 
chose eight loci whose copy-number assignments were >45% 
identical and whose lengths were in different ranges (~3, ~5, ~8, 
~20, ~24, ~45, ~60, ~100 kb). The coordinates and CN status of 
the 8 loci are described in Table 3. We only retained samples hav- 
ing identical results between the methods. We downloaded BAM 
files of the 1 Mb regions around these loci and let CNVrd2 auto- 
matically assign CN for the loci (CN groups of these loci were 
known in advance from the micro-array based results, Table 3). 

COMPARISON WITH CNVrd 

We compared the new pipeline with our previous pipeline at the 
eight loci mentioned above (Table 3) and the eleven loci previ- 
ously measured in Nguyen et al. (2013). For the eight loci, both 
pipeHnes were run and the percent concordances were calculated. 



For CNVrd2, we used different window sizes (0.1, 0.2, 0.5, 1, 2, 
5, 10, 20, and 50 kb) and different values (0.15, 0.25, 0.35, and 
0.45) of testThreshold2Merge. For CNVrd, we only used differ- 
ent windows (0.1, 0.2, 0.5, 1, 2, 5, 10, 20, and 50 kb) because the 
merging process was not implemented in CNVrd. For the eleven 
loci, we used a "testThreshold2Merge" value of 0.35 and the 
same window size as CNVrd [FCGR3A and FCBR3B (1000 bp), 
RHD (2000 bp), UGT2B17 (500 bp), GSTTl (1000 bp), IGLL3P 
(1000 bp), SMN2 (1000 bp), GSTMl (200 bp), CFHRl (500 bp), 
CNTNAP3 (2000 bp) and 7GIL5 (1000 bp)]. 

APPLICATION OF CNVrd2 TO TWO COMPLEX LOCI: CCL3L1 AND 
DEFB103A 

Windows of 500 and 1000 bp were used to analyze CCL3L1 
and DEFB103A, respectively. Steps described above were used 
to obtain putative boundaries of CCL3L1/DEFB103A gene- 
containing regions and final segmentation scores for the two 
loci. For each locus, one population having clear clusters of 
segmentation scores was used to obtain prior information for 
all populations: the European and South Asian population for 
CCL3L1 and DEFB103A, respectively. This approach was imple- 
mented using the EM algorithm (Dempster et al., 1977). The 
results for the European and South Asian populations (means 
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Table 3 | Eight loci which were obtained from the intersection of results of Conrad et al. (2010) and Campbell et al. (2011) were used to obtain 
parameters for CNVrd2 and to compare between CNVrd2 and CNVrd. 



Locus 


Region downloaded 


Sample size° 


CN"" 


chr7:141769627-141793931 


7:141000000-142000000 


258 


0 (3; 1.2) 1 (44; 171) 2 (208; 80.6) 3 (3; 1.2) 


chr3:162514938-162619146 


3:162000000-163000000 


136 


0 (93; 68.4) 1 (28; 20.6) 2 (15; 11.0) 


chrl7:44212815-44270230 


17:43500000-44500000 


252 


2 (195; 774) 3 (38; 15.1) 4 (19; 75) 


chr1:110222301-110242933 


1:109500000-110500000 


251 


2 (106; 42.2) 3 (102; 40.6) 4 (43; 171) 


chr17:44212815-44270230 


17:43500000-44500000 


252 


2 (195; 774) 3 (38; 15.1) 4 (19; 75) 


chr2: 7933 1533-79339762 


2:79000000-80000000 


231 


2 (218; 94.4) 3 (13; 5.6) 


chrl6:72109587-72112297 


16:71500000-72500000 


251 


2 (226; 90.0) 3 (25; 10.0) 


chr3:26434104-26439360 


3:26000000-27000000 


259 


0 (3; 1.2) 1 (14; 5.4) 2 (240; 92.7) 3 (2; 0.8) 



^ These samples result from the intersection of the samples of Conrad et al. and Campbell et al. (Conrad et al., 2010; Campbell et al., 201 1) and the 1000 Genomes 
Project. 

^in parentheses are the number and percentage of samples with the specified CN, as obtained from Conrad et al. (2010) and Campbell et al. (2011). 



and variances of groups and the distances between the means 
of the groups) were used to provide prior information for a 
Bayesian normal mixture model that was applied to the other 
populations. This was done to improve model stability and fit 
for populations exhibiting high levels of CN polymorphism. The 
Markov chain Monte Carlo sampling method implemented in 
the rjags package (Plummer, 2011) was used to obtain poste- 
rior estimates for each normal mbcture model parameter — this 
allows control of parameters using prior information. An adaptive 
phase of 100 iterations was run, followed by a burn-in period 
of 1000 iterations. Next, we ran 20,000 iterations and calculated 
the means, standard deviations, and proportions of the mix- 
ture components from 20% of the iterations. Convergence was 
assessed by using Heidelberger and Welch's stationarity and half- 
width test implemented in the coda package (Plummer et al., 
2006). 

COMPARISON WITH OTHER METHODS: CCL3L1 

Copy number assignments at CCL3L1 by cn.Mops (Klambauer 
et al, 2012), CNVnator (Abyzov et al, 2011) and Sudmant et al. 
(2010) were compared to assignments of CNVrd2 on the 180 
samples measured by the modified PRT. To run CNVnator, we 
downloaded all of the chromosome 17 data for the 180 samples. 
CNVnator was applied as previously described (Nguyen et al., 
2013). cn.Mops uses read-count information of single samples 
and multiple samples to detect CN variable regions and uses 
a Poisson mixture model to automatically infer CN. We used 
CNVrd2 to obtain matrices of read counts in the 2Mb region 
(chrl7:33670000-34670000) in eight different scenarios: constant 
windows of 25000, 20000, 10000, 5000, 2000, 1000, 500, and 
200 bp with default values for other parameters. The 5000 bp win- 
dow for cn.Mops had the highest proportion of samples having 
CN > 3 and the highest concordance with other methods and 
was used in the Figure 3 comparison. The Sudmant et al. (2010) 
read count-based method utilized all possible mapping locations 
of a read combined with singly unique nucleotide positions to 
measure CN for 169 samples from the 1000 Genomes Project 
at different loci, including the CCL3L1 gene. The CCL3L1 gene 
results were validated by using Q-PCR based assays with high cor- 
relation being observed (r = 0.95) (Sudmant et al., 2010). Ill of 



the 169 samples overlapped with the samples analyzed here using 
CNVrd2, and by Carpenter et al. (2011) using the PRT-based 
methods. The coordinates of CCL3L1 in the Sudmant et al. (2010) 
data were chrl7:34623842-34625730 (hgl9). 

COMPARISON WITH OTHER METHODS: DEFB103A 

The data sets of Sudmant et al. (2010) and Hardwick et al. (201 1) 
were intersected and 104 samples were obtained that overlapped 
with the 2535 1000 Genomes samples we analyzed using CNVrd2. 
The coordinates of the DEFB103A gene in the Sudmant et al. 
(2010) data were chr8:7738913-7740180 (hgl9). For CNVnator, 
we downloaded all of the chromosome 8 data for the 104 sam- 
ples and used a similar approach as for CCL3L1. For cn.Mops, 
we used CNVrd2 to obtain matrices of read counts in the 2 Mb 
region (chr8:6500000-8500000) for 2535 samples in eight differ- 
ent scenarios: constant windows of 25000, 20000, 10000, 5000, 
2000, 1000, 500, and 200 bp with defauh values for other parame- 
ters. The 10000 bp window was the only window length to call CN 
(six samples assigned CN = 3, the remainder assigned CN = 2) 
and was used in the Figure 3 comparison. 

LiftOver (Meyer et al, 2013) on the UCSC Genome browser 
was used to convert coordinates of microarray data and the 
Sudmant et al. (2010) data from Human Reference Genome hgl8 
to Human Reference Genome hgl9. 

RESULTS 
DATA 

Alignment results are presented in Figure 4. At CCL3L1 a total 
of 395,078,047 reads across all samples were aligned to a 2 Mb 
region around the gene (Chrl7:33670000-35670000). These reads 
had lengths ranging from 36 to 160 bp (median of 91.5 bp and 
the highest frequency, 44.9%, was 100 bp), and mapping qual- 
ities from 0 to 70 (median of 33.5 and the highest frequency, 
73.4%, was 60). The majority of reads (73.5% and 72.5%) align- 
ing to the CCL3Ll-containing region (chrl7:34617501-3465201) 
and CCL3L1 gene (Chrl7:34623842-34625730) had a mapping 
quality of 0, presumably reflecting multiple alignments to the 
paralogs within the locus. 

A total of 386,195,488 reads across all samples were aligned 
to the 2 Mb region on Chr 8 (Chr8:6500000-8500000) around 
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the defensin genes (Figure 4). These reads had lengths ranging 
from 36 to 160 bp (median of 91.5 bp and the highest frequency, 
45%, was 100 bp), and mapping qualities from 0 to 70 
(median of 33.5 and the highest frequency, 40.2%, was 60). 
The majority of reads (64.1 and 99%, respectively) aligning 
to the DEFB103A-containing gene (chr8:7641001-7742001) and 
DEPB103A region (Chr8:7738726-7740105) had a mapping qual- 
ity of 0. The average coverage of 2535 samples on the 2Mb 
region was between 2.8 and 30.9 x with a median of 6.9 x 
(Figure 4). 

CNVrd2: INFERRING RELIABLE VALUES FOR PARAMETERS 

We used eight loci derived from the data sets of two microarray 
approaches (Conrad et al., 2010; Campbell et al., 201 1) to obtain 
parameters of CNVrd2. CNVrd2 had good performance when the 
window size was 200 or 500 bp and a testThreshold2Merge value 
of 0.35 was used (Figures). A small window size (e.g., 100 bp) 
was not stable at all loci, possibly because there were multiple 
low-coverage samples in the 1000 Genomes Project data (e.g., 
464/2535 = 18.2% samples having coverage in the CCL3L1 gene- 
containing 2 Mb region < 5 x , Figure 4). This could lead to some 
windows having very low read counts, and thus being segmented 
incorrectly. For example, some windows in normal regions (i.e., 
CN = 2) would be segmented into deleted regions or some win- 
dows in duplicated regions would be segmented into normal 
regions. Similarly, large windows (e.g., 5000 bp) also generated 
unreliable results (Figure 5). 

CNVrd2: COMPARISON WITH CNVrd AT EIGHT TEST LOCI 

Our previous work was focused on loci having low copy-number 
ranges. Even though this present study is concentrated on loci 



with CN range from zero to >5 CN, we also compared with 
CNVrd (Nguyen et al, 2013) to assess the performance of the 
modified pipeline, especially in the steps requiring obtaining 
and adjusting segmentation scores for each sample. Loci used 
in this comparison had copy number ranging from 0 to 5, and 
their boundaries were known (Table 3 and Nguyen et al, 2013). 
Therefore, a simple normal mixture model was used in CNVrd2 
to cluster segmentation scores into different groups. CNVrd2 had 
more stable results than those of CNVrd at almost all window 
sizes (Figures 5, 6). Merging sub-regions and adjusting bound- 
aries (Step D and E, Figure 1) produced more reliable results 
for CNVrd2 at two complex regions (chrl7:34436344-34481815 
and chrl:110222301-110242933). At these two regions, CNVrd 
showed very low concordance with microarray-based approaches 
(Figure 6). 

CNVrd2: COMPARISON WITH THE ELEVEN LOCI USED IN CNVrd 

We also applied CNVrd2 to eleven CNV regions (including 
FCGR3A/3B) to which CN was assigned using CNVrd in our 
previous work (Nguyen et al., 2013). Better concordances were 
seen at five loci for CNVrd2 {CFHRl: 94.29/91.9%, RHD: 
99.05/98.57%, GSTMl: 95.24/93.81%, SMN2: 90.95/88.1%, 
UGT2B17: 99.05/97.14%) and the same concordances were 
seen at the other four loci (GSTTl: 94.76%, IGLL5: 95.24%, 
CNTNAP3: 81.43%, IGLL3P: 99.52%). For FCGR3A and 
FCGR3B, applying CNVrd2 to the same set of 952 samples in our 
previous work produced slightly higher concordances than for 
CNVrd when compared with PRT-based CN assignments (HoUox 
et al, 2009) (FCGRSA: 84.2/82% and fCGJ^3B:83.5/82.7% for 
CNVrd2 and CNVrd, respectively). 
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FIGURE 5 I Concordance between the CNVrd2 and microarray based results. The x-axis contains the values of testThreshold2Merge (0.15, 0.25, 0.35, 
0.45). The y-axis is the percentage of identical results. 



APPLICATION OF CNVrd2 TO TWO COMPLEX LOCI; CCL3L1 AND 
DEFB103A 

Information on read length, mapping quality, read depth and read 
count ratios is presented in Subjects and Methods ("HTS and 
microarray data used") and in Figure 4. 

Based on the results of the eight test loci (above. Figures 5, 6), 
the BAM files were processed using 500 bp windows for CCL3L1 
and 1000 bp windows for DEFB103A, and a testThreshold2Merge 
= 0.35. 1000 bp windows were chosen for DEFB103A because 
there was higher correlation between segmentation scores of 
DEFB103A and DEFB103B than for 500 bp windows (r = 0.80 
and 0.78, respectively). Owing to the small length of each gene 
(1888 bp for CCL3L1 and 1379 bp ioi DEFB103A) larger window 
sizes were not investigated as this could result in not detecting 
breakpoints nearby or inside the genes. 



POLYMORPHISM OF THE CCL3L1 LOCUS 

Using a 2 Mb region it took approximately 3 h to assign copy 
number at CCL3L1 on 2535 samples using a 4-core com- 
puter with 8 Gb memory. Standard deviations and different 
percentiles of the segmentation results of sub-regions on the 
2-Mb region were calculated to visually detect this region. Using 
a standard deviation threshold of 0.5, we identified CCL3L1 
within a large polymorphic CN variable region of approximately 
329 kb (chrl7:34436001-34815000, which included a 50 kb gap) 
(Figure?). This region included small CNV blocks of clearly 
defined increased standard deviation (SD) which suggested that 
there would be multiple recombination events inside the CNV 
region. CCL3 was outside this region, CCL4 was at the boundary, 
while CCL3L1 was in a second block with CCL3L3 and CCL4L1. 
The CCL3L1 gene was located in a sub-region which had SDs 
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fluctuating around 0.84 and 0.85. CCL3L3 and CCL4L1 were 
also in a sub-region with SDs fluctuating around 0.83 and 0.86, 
but these sub-regions were separated from the CCL3L1 gene- 
containing sub-regions by a decrease (0.8) in SDs between two 
blocks (Figure 7). We merged these sub-regions and the bound- 
aries of the CCL3L1 -containing region were chrl7:34617501- 
34652500. Using CNVrd2, we obtained segmentation scores for 
both the CCL3L1 gene-containing region and the larger CCL3L1 
region. The segmentation scores of the two regions were strongly 
correlated (r = 0.98, Figure 8). 

ASSIGNING CCL3L1 COPY NUMBER 

The European CCL3L1 segmentation results from the smaller 
35kb gene-containing region were assigned by CNVrd2 to five 
groups (Figure 9). All clusters were clearly delineated, although 



the highest group exhibited some scatter. The results obtained 
from the EM algorithm applied to the European samples were 
used as prior information to obtain CCL3L1 CN estimates for 
the 2535 samples from the 1000 Genomes Project. The final 
segmentation scores across all samples ranged from —1.66 to 
4.53 (Figiu'e9), with only one score being larger than 3.42. 
This value was considered to be an outlier, and the associ- 
ated sample (HG00620 in the CHS population) was assigned to 
the largest copy number group with a probability of 1. Thus, 
the segmentation score range used to determine the number of 
groups was between —1.66 and 3.42. The distances between the 
groups in the European population ranged between 0.39 and 0.62. 
As a result, 10 groups were chosen to encompass the range of 
CN values for the full collection of samples. The Markov Chain 
Monte Carlo chain was convergent after 10,000 iterations and the 
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FIGURE 7 I Plots of polymorphic regions encompassing CCL3L1 (A) 
and DEFB103A (B). The plots show standard deviation (top) and different 
percentiles (bottom) across 2 Mb sub-regions (for all 2535 samples). 



segmentation scores were clustered into 10 groups corresponding 
to CN from 0 to >9 (Figure 9). 

The CN assignments for European-ancestry populations 
ranged between 0 and 5, with the highest frequency (43.6%, 
220/505) being 2 copies (Table 1). Similar to the European- 
ancestry populations, CN = 2 was the most common (42.1%; 
208/494 samples) in the South Asian sample set. The East Asian 
sample set had CN between 0 and >9, with the most common 
being CN = 3 (36.7%, 189/515). CN = 3 was also the most fre- 
quent in the Americas sample set, with the highest CN = 8. The 
African sample sets had no individuals having 0 copies and just 



r = 0.98 r = 0.95 




-1 01234 -2 01234 

CCL3L1 SS DEFB103ASS 

FIGURE 8 I Correlation between segmentation scores of 
CCL3L1/DEFB103A and polymorphic regions encompassing the genes. 



10 samples (1.5%) having 1 copy, with 3-5 copies being the most 
common (Table 1). 

VALIDATION OF CCL3L1 CN ASSIGNMENTS WITH PARALOG RATIO 
TEST(PRT) DATA 

The CNVrd2 results on 180 of the samples were compared to 
data where CCL3L1 CN had previously been determined using 
PRT — 45 European samples measured by Carpenter et al. (2011) 
and 135 African and Asian samples measured by Janyakhantikul 
et al. (2010). There was 77.8% (140/180) identity between the two 
methods (Figure 3A). The majority of discordant results were for 
high CN samples and differences were of one CN. 

COMPARISON OF CNVrd2 WITH OTHER READ DEPTH-BASED 
METHODS AT CCL3L1 

The packages CNVnator (Abyzov et al., 2011) and cn.Mops 
(Klambauer et al., 2012) were compared to CNVrd2 (refer also to 
Subjects and Methods). In addition, we also used the 159 genome 
data set of Sudmant et al. (2010). Copy number for samples in 
this data set were measured using a read depth-based method, but 
the authors used the mrsFAST aligner (Hach et al., 2010) which 
obtains all possible positions of a read. The concordance rates for 
CNVnator, cn.Mops, and the approach of Sudmant et al. (2010) 
with CCL3L1 PRT-based results were 7.2% (13/180, r = 0.95), 
36.7% (66/180, r = 0.65), and 0% (0/111, r = 0.93) respectively 
(Figures). The highest CNs called by CNVnator and cn.Mops 
were only 5 and 4 respectively. CN assignments of Sudmant et al. 
(2010) were significantly higher than the modified PRT results, 
although assignments of the two methods were highly correlated 
(r = 0.93) (Figures). 

To investigate the discordant results of the other packages 
with the PRT-based method at CCL3L1, observed read-count 
ratios of European sample sets and all 2535 samples were calcu- 
lated for both the CCL3L1 gene region and CCI3Li -containing 
region (Figure 10). The two medians of these ratios were consid- 
erably less than 1 (0.7 and 0.8, respectively), while the expected 
median was 1. For the larger CCL3L J -containing region, 74.7% 
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(1893/2535) of samples had ratio <1, 76.4% (1938/2535) for 
the 35kb CCL3L1 region. Reads of samples having zero copies 
which were aligned to CCL3L1 gene region were very close to 
zero (Table 4), and thus the density of reads at CCL3L1 was not as 



high as expected. Approaches based solely on read depth with no 
cross-sample standardization would therefore have difficulty in 
accurately assigning CCL3L1 CN, especially for high-CN samples, 
which would be assigned to lower-CN groups. 
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Table 4 | Read counts at the CCL3L1 gene (chr17:34623842-34625730) 
(1.9 kb) of samples called 0 copies by CNVrd2. 

Sample Average Read count Read count Population 
ID coverage at CCL3L1 at CCL3L1- 

(2Mb) containing region 



The average coverage is calculated for the 2Mb region (chrl7:33670000- 
35670000). 

POLYMORPHISM OF THE DEFB103A LOCUS 

The DEFB103A region lies within a polymorphic inversion with 
complex structure (Sugawara et al., 2003). We used quantile 
values and standard deviations across sub-regions to visually 



identify the boundaries of the polymorphism encompassing the 
gene. Using a standard deviation threshold of 0.5, DEFB103A 
was identified within a large CN polymorphic region (consis- 
tent with Groth et al, 2008) of 1078 kb (chr8:70200000-7474000 
7532001-8098000 including a 58 kb gap) (Figure?). This CNV 
region also included small polymorphic blocks which suggested 
that there would be multiple recombination events inside the 
CNV region. DEFB103A was in the same block as DEFB104A 
and DEFB107A. The standard deviations of this block fluctu- 
ated around 0.70 and 0.74. We merged these sub-regions and 
the boundaries of the DEFB103A gene-containing region were 
chr8:764 100 1-7742000. The segmentation scores of this region 
and the larger DEFB103A gene region were strongly correlated 
(r = 0.95, Figures). We used this region to calculate CN for 
DEFB103A. The DEFB4A gene was outside this block owing to 
a relatively high peak (SD was 0.81 whilst the maximum SD of 
the DEFB103A gene-containing region was only 0.74) between 
DEFB4A and the DEFB103A block (Figure 7). 

We also calculated segmentation scores of DEFB103B (para- 
log of DEFB103A) to compare segmentation scores of this gene 
and those of the DEFB103A gene-containing region. The seg- 
mentation scores of the two regions were not as highly corre- 
lated (r = 0.83), but they made clear clusters of distinct groups 
(Figure 11). 

ASSIGNMENT OF DEFB103A CN 

The segmentation scores of the gene had clear clusters (Figure 9) 
with some scattered values >3, therefore these values were allo- 
cated into the largest (>3) group. The South Asian popula- 
tion had the clearest clusters therefore we used this population 
to obtain prior information and ran the EM algorithm upon 
the segmentation score with six groups. The distances between 
the groups were 0.50-0.69 therefore 0.57 was set as the prior 
distance. Eight groups were defined for all segmentation scores 
and CNVrd2 was used to assign DEFB103A copy- number using 
similar parameters as for CCL3L1 in the Bayesian clustering 
process (Table 2). Because the observed read count ratio range 
was reasonable (0.53-1.93) (Figure 11), this suggested that the 
lowest CN was greater than zero, with the scattering of values 
on the left (Figure 9) assigned to the smallest group {CN = 2) 
during the clustering process. A CN of 4 was the most common 
in all populations (Table 2; 26.3-47.9%), consistent with previous 
reports (Armour et al., 2007; Hardwick et al., 201 1). 

CNVrd2 obtained high concordance with PRT-based meth- 
ods (90.4%, r = 0.95, 94/104) (Figure 3). PRT CN assignments 
for 104 samples were available (Hardwick et al., 2011) that also 
overlapped with the samples analyzed by Sudmant et al. (2010). 
The concordances between Sudmant et al. (2010), CNVnator 
(Abyzov et al., 2011) and cn.mops (Klambauer et al., 2012) 
(using 10 kb windows for cn.mops) with PRT-based methods 
were 54.8% (57/104), 1% (1/104) and 4.8% (5/104) respectively 
The majority of differences between CNVrd2, Sudmant et al. 
(2010) and the PRT-based methods were one copy. 

DISCUSSION 

The CNVrd2 package, depicted in Figure 1, is an improved ver- 
sion of CNVrd previously used to assign CN at the FCGR locus 
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FIGURE 11 I Correlation between segmentation scores of the 
DEFB103A gene-containing region and the enlarged region 
encompassing DEFB4A (top left) and the DEFB103B gene 



region (bottom left). On the right are histograms of 
segmentation scores of the enlarged region (top) and DEFB103B 
region (bottom). 



(Nguyen et al., 2013). CNVrd2 was designed to correct issues that 
can occur during the segmentation process which can result in 
CNVrd assigning a segmentation score of zero, resulting in some 
samples being assigned to unreasonable CN groups. In addi- 
tion, in CNVrd2, a linear regression model was used to adjust 
the differentiation in CN between populations. Clustering data 
into different groups is challenging at complex loci with high 
ranges of CN. Therefore, a Bayesian clustering strategy was inte- 
grated into CNVrd2 in order to obtain more reliable results at 
CCL3L1 and DEFB 103 A. Finally, plots of percentiles and standard 
deviations of segmentation results were also added to CNVrd2. 
These can be used as visual tools to identify copy- number variable 
regions encompassing loci being measured (Figure?). In evalu- 
ating our results, it should be noted that CNVrd2 was developed 
on the FCGR and CCL3L1 loci. Therefore, the concordance com- 
parisons are biased toward CNVrd2. However DEFB103A can be 
regarded as a genuine test locus, on which CNVrd2 performed 
well (Figure 3). There is a paucity of loci for which published PRT 
gold-standard data are available on complex CN loci. 

CNVrd2 obtains higher concordance with PRT-based methods 
than other CN-assigning approaches. However the concordance 
was not 100% (Figure 3; 77.8% at CCL3L1 and 90.4% at 
DEFB103A), which would limit its use in case-control analysis of 
complex CN loci in common disease. We used BWA alignments 



from the 1000 Genomes Project, where the majority of mapping 
qualities were zero owing to reads having multiple hits within 
segmental duplications. Further improvement of CNVrd2 could 
come from the use of alignments better optimized to detect CNV, 
such as mrsFAST (Hach et al., 2010) which allows diverged par- 
alogs (1-2% difference) to be differentiated. It is important also 
to stress that CNVrd2 was developed to measure CN at speci- 
fied loci, while other comparison packages are more automated 
approaches that both detect CNV and assign CN. CNVrd2 is auto- 
mated to produce individual segmentation scores at a locus and 
assign CN using the Bayesian Information Criterion (Schwarz, 
1978). However it is better to manually undertake the clustering 
process as we did in this study. The reason is that clear clusters 
were not seen at complex loci CCL3L1 and DEFB103A. Note that 
CNVrd2 is enhanced by some prior knowledge of CN. 

From the observed read count ratio analysis (Figure 10), it can 
be seen that if only the read-count information from one sam- 
ple is used to assign CN (as is done by CNVnator) the majority 
of samples at the CCL3L1 locus would likely be called as deletions 
because of the inherently low number of reads in that region, pos- 
sibly resulting from CCL3L1 reads mapping to paralogs nearby. 
The majority of populations had observed read-count ratios of 
less than 1 . This could also be the reason that the cn.Mops package 
did not call high CN for multiple samples (Figure 3). Complete 
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discordance was seen between the Sudmant et al. (2010) CCL3L1 
results obtained using multiple hits of a read and all other meth- 
ods (Figures). The results of Sudmant et al. (2010) had been 
validated by PCR-based methods with good correlation. It is 
important to note that PCR-based methods are unreliable in mea- 
suring complex gene CN (Walker et al., 2009; McKinney and 
Merriman, 2012) — any differences in amplification efficiency that 
alters the ratio between the test and reference genes wOl alter 
the apparent CN (Armour et al., 2007). One possible reason 
for the Sudmant et al. (2010) discordance could be the use of 
different clustering strategies, but it could also be due to using 
different alignment methods when there are >1 read-mapping 
possibilities: choosing a random position (CNVrd2) vs. using all 
possible positions for a read (Sudmant et al., 2010). We checked 
the only sample (NA11831) called CAT = 1 by Sudmant et al. 
(2010) and zero by CNVrd2, cn.Mops, CNVnator, and PRT-based 
methods. There was only one read aligned to CCL3L1 (Table 4), 
while the average coverage for this sample on the 2 Mb region 
was adequate (5.8 x). Similarly, checking all samples called as 
0 copies by CNVrd2, we also saw that the number of reads 
aligned to the CCL3L1 gene region were extremely low (Table 4). 
This suggests that the sample NA11831 was correctly assigned 
CN = 0 for CCL3L1 by the PRT methods and the BWA aligner 
based next-generation sequencing methods (CNVrd2, CNVnator, 
cn.Mops). 

At the beta-defensin locus, we identified a CNV region which 
had nearly equal standard deviations of sub-regions. This CNV 
region encompassed DEFB103A. The segmentation scores of this 
and the DEFB103B region (Figure 7; on the polymorphic section 
before the gap) were moderately correlated (r = 0.83) but they 
made clear clusters of different groups (Figure 11). This suggested 
that individual CNs of the two regions would be nearly identical. 
However, the identified region did not include the DEFB4A gene 
because we saw a high peak between the region and DEFB4A gene 
(Figure 7). We retested with different windows (500 and 2000 bp) 
and still observed this peak (data not shown). An enlarged region 
that included the peak and DEFB4A was strongly correlated with 
the smaller region (r = 0.95) but had multiple samples with a 
score of zero, because in CNVrd2 if a region was segmented into 
different sub-regions having different signs then the segmentation 
score of the region was assigned to zero. This suggested that there 
was not full duplication of the enlarged region (Figure 11). This 
situation could occur because of an error in the alignment pro- 
cess, or because there were heterogenous breakpoints within the 
enlarged region. 

At the complex CCL3L1 and DEFB103A loci CNVrd2 assigned 
CN with greater accuracy than the other read depth-based CN 
assignment methods CNVnator and cn.Mops. The CNVrd2 
package is implemented in R (as of version 3.0.1) and is avail- 
able from Bioconductor (as of version 2.14) at http://www. 
bioconductor.org/packages/release/bioc/html/CNVrd2.html, 
where installation and usage instructions can be found. 
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